Fake News Detection

1 Introduction

In the era of digital media, the spread of misinformation has become a significant concern. Fake news can have devastating consequences, influencing public opinion, and undermining trust in institutions. This project tackles the critical task of developing a fake news detection model, leveraging a comprehensive dataset of labeled news articles. By applying machine learning and natural language processing techniques, I aim to identify patterns and characteristics that distinguish true from fake news, and create a reliable model that can accurately classify news articles, ultimately contributing to a more informed and trustworthy online environment.

2 Setup

Imports

Show the code
%reload_ext autoreload
%autoreload 1
from lime.lime_text import LimeTextExplainer
import torch
from datetime import date
from sklearn.metrics import precision_score, recall_score
import polars as pl
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
import nltk
from transformers import BertTokenizer
from auxilary import display_functions as disf
import joblib
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from nltk.tokenize import word_tokenize
from transformers import BertTokenizer
from nltk.corpus import stopwords
from collections import Counter
import plotly.io as pio
import plotly
from sklearn.metrics import confusion_matrix
from IPython.display import clear_output
from sklearn.metrics import f1_score, classification_report
import numpy as np
from auxilary.dl_helper import predict_probabilities
import mlflow
from model.data_loader import NewsDataModule
from transformers import AutoTokenizer
%aimport auxilary.display_functions
%aimport auxilary.dl_helper

Options

Show the code
BASE_FIG_SIZE = (8.5, 5)
STOPWORDS.add("s")
STOPWORDS.add("u")
pio.renderers.default = "plotly_mimetype+notebook_connected"

Loading the Data

Show the code
true_headlines, fake_headlines = joblib.load("data/clean_data.joblib")
merged_data = pl.concat(
    [
        fake_headlines.with_columns(pl.lit(1).alias("is_fake")),
        true_headlines.with_columns(pl.lit(0).alias("is_fake")),
    ]
)

3 Data Cleaning

The original data set included 44898 headlines.

The data cleaning procedures can be found in cleaning.ipynb notebook. Data cleaning involved: * Striping the source. All true text’s began with a specific news source indicating string: “LOCATION (Reuters) -” to not over-fit on this particular string this part was removed from true news texts.

  • Extra space removed. All true texts’s had an extra whitespace at the end which was removed.

  • Removing items with bad dates. Some fake texts had links instead of dates these items were removed (10 items)

  • Removing headlines with links. A over 3000 fake headlines contained links compared to only 2 true headlines. To not over-fit on this particular feature headlines with links were removed.

  • Removing headlines with no intelligible text. Using the langdetect module 673 Fake headlines and one true headline were detected as not english. These texts contained single words or unintelligible text and thus were removed.

  • Dropping duplicates. 225 true and 4592 fake duplicated headlines were removed.

  • Removing very short headlines. 226 fake headlines with less than 20 words were dropped.

Data points left after cleaning:

Show the code
len(merged_data)
35847

4 Exploratory Data Analysis

4.1 Data Distribution

Example Fake Items:

Show the code
disf.table_display(fake_headlines.head(2))
title text subject date language
WATCH: Kellyanne Conway Pathetically Begs People To Buy Ivanka Trump’s Products On Fox News Donald Trump is clearly using the office of the presidency to enrich himself and his family, a clear violation of the Constitution.In January, the retailer Nordstrom informed Ivanka Trump of their decision to drop her product line from their stores. A boycott of Ivanka s products and any retailer who sells them contributed to the decision. That s the way the free market works.But after the decision became public earlier this week Trump lost his shit over it on Wednesday.My daughter Ivanka has been treated so unfairly by @Nordstrom. She is a great person always pushing me to do the right thing! Terrible! Donald J. Trump (@realDonaldTrump) February 8, 2017Nordstrom responded by announcing that the decision was made purely for business reasons and that Ivanka was informed last month and understood the choice they made. Since then, Nordstrom s stock has risen and other retailers have also dropped Ivanka s products.And so, Donald Trump sent Kellyanne Conway on Fox News on Thursday to give free advertising to Ivanka s business. They are using the most prominent woman in Donald Trump s life, Conway began. Using her, who has been a champion for empowerment of women in the workplace, to get to him. I think she s gone from 800 stores to a thousand stores, or a thousand places where you can buy. You can buy her goods online, she continued before literally telling Fox viewers to go buy her products. Go buy Ivanka s stuff is what I would tell you. I hate shopping, I m going to go get some myself today. This is just wonderful line. I own some of it. I fully I m going to give a free commercial here. Go buy it today, everybody. You can find it online. Here s the video via YouTube.Seriously. Kellyanne Conway actually shilled for Ivanka s products during an interview. This is not only an unconstitutional conflict of interest, it s completely unfair. Trump is literally using his platform to advertise his daughter s business, a business she was supposedly going to walk away from because she joined Trump s White House team. Clearly, she lied just like her daddy did.The fact that Conway appeared on Fox to advertise for the Trumps demonstrates that Donald Trump only sees the presidency as a way to enrich himself and his family. He doesn t give a damn about the economy or foreign policy or any other problems we face. He only cares about personal profit and this is proof of his selfishness.Featured Image: Yana Paskova/Getty Images News 2017-02-09 00:00:00 en
“LIBERAL BULLY” Middle School Teacher Tells Students 13 Yr. Old Black Conservative’s “Not Worth Saving in a Fire” [VIDEO] Outspoken conservative CJ Pearson hasn t heard from the White House and doesn t expect to receive an invitation A teacher at Columbia Middle School in Evans, Georgia allegedly told his students that outspoken 13-year-old black conservative CJ Pearson was not worth saving in a fire, and that he hated him. This is just the latest example of how liberals seem to believe that hate speech is acceptable so long as it s directed at those on the right and especially minority conservatives.Via Paul Joseph Watson at Infowars:Pearson previously made headlines after his Twitter account was blocked by President Obama s official Twitter account following a video in which Pearson criticized Obama over his response to the Clock Kid controversy. White House officials also made fun of the teenager.Pearson was told by several other students in his class that teacher Michael Garrison said CJ is not worth saving in a fire and that he hates him. The teacher also accused Pearson of cheating on a vocabulary test when he was in sixth grade, a claim that Pearson denies. It s always great having a teacher that s not only a liberal bully, but someone who engages in slander, Pearson told BizPac Review. My words are bold and I don t expect everyone to agree. But to have a teacher say this about me? Completely inexcusable. School principal Eli Putnam has promised a full investigation into the matter. Pearson accuses the teacher of violating the school s bullying policy. Via: Gateway PunditHere s conservative CJ Pearson asking Barack Obama: Does every Muslim that can build a clock gets a presidential invitation? left-news 2015-10-18 00:00:00 en

Example True Items:

Show the code
disf.table_display(true_headlines.head(2))
title text subject date language
Zambian president urges unity as government, opposition prepare for talks Zambian President Edgar Lungu on Friday called for unity among political groups ahead of talks between the government and the opposition aimed at reconciliation after a political crisis earlier this year. The leader of the opposition United Party for National Development (UPND), Hakainde Hichilema, was arrested with five others in April and charged with plotting to overthrow the government after his convoy failed to make way for Lungu s motorcade. The case stoked political tensions in Zambia, a major copper producer and seen as one of Africa s more stable and functional democracies, following a bruising election last year. Hichilema was freed from prison in August after the state dropped the charges, to pave the way for dialogue between the two sides following mediation by Commonwealth Secretary-General Patricia Scotland. Scotland s special envoy Ibrahim Gambari is in Zambia and has separately held talks with Lungu, Hichilema and other opposition leaders. In an address at the opening of the national assembly, Lungu said Zambians could disagree and quarrel but would always remain one. The factors that unite us are much greater than those that seek to divide us, he said. Opposition UPND members of parliament, who boycotted Lungu s last address, attended Friday s session, saying their attendance would give confidence to the process of dialogue. The UPND MPs took this decision in the interest of the country in view of the forthcoming political dialogue, their spokesman Jack Mwiimbu said in a statement. worldnews 2017-09-15 00:00:00 en
Dozens of unidentified bodies found near Libyan city of Benghazi The bodies of 37 unidentified people have been found near the eastern Libyan city of Benghazi, security sources said on Friday. The bodies were found on Thursday night in Al-Abyar, about 70 km (44 miles) east of Benghazi. The security sources gave no information about their possible identity. Smaller numbers of bodies have been found in and around Benghazi on several occasions in recent months. The area is controlled by the Libyan National Army (LNA), a force headed by eastern-based commander Khalifa Haftar. He declared victory in a campaign for Benghazi in July, though some fighting has continued in one district of the city. worldnews 2017-10-27 00:00:00 en

Class counts:

Show the code
fig_data_count, ax_data_count = plt.subplots(figsize=BASE_FIG_SIZE)
sns.barplot(
    y=[len(fake_headlines), len(true_headlines)], x=["True", "Fake"], ax=ax_data_count
)
ax_data_count.set_ylabel("Count")
plt.show()

Figure 1: Class balance of news items in the full dataset after cleaning.

In total there are more fake news in the data set than true news items.

Headlines by subject:

Show the code
fig_subjects, ax_subjects = plt.subplots(1, 2, figsize=BASE_FIG_SIZE)
sns.countplot(fake_headlines["subject"], ax=ax_subjects[0])
sns.countplot(true_headlines["subject"], ax=ax_subjects[1])
ax_subjects[0].set_title("Fake News")
ax_subjects[1].set_title("True News")
plt.tight_layout()
plt.show()

Figure 2: News item subjects in the dataset.

The subject information for fake and true headlines did not align between the classes, therefore the subject was not used.

4.1.1 Temporal distribution

The temporal distribution of fake and true headlines in the dataset:

Show the code
true_headline_weekly = true_headlines.group_by(pl.col("date").dt.round("1w")).count()
fake_headline_weekly = fake_headlines.group_by(pl.col("date").dt.round("1w")).count()
fig_dates = make_subplots(rows=2, cols=1, shared_xaxes=True)
fig_dates.add_trace(
    go.Bar(
        x=true_headline_weekly["date"], y=true_headline_weekly["count"], name="True"
    ),
    row=1,
    col=1,
)
fig_dates.add_trace(
    go.Bar(
        x=fake_headline_weekly["date"], y=fake_headline_weekly["count"], name="Fake"
    ),
    row=2,
    col=1,
)

highlight_intervals = [
    {
        "start": date(2016, 1, 1),
        "end": date(2017, 8, 31),
        "color": "rgba(255, 255, 0, 1)",
        "name": "Training",
    },
    {
        "start": "2017-09-01",
        "end": "2017-10-31",
        "color": "rgba(0, 255, 0, 1)",
        "name": "Validation",
    },
    {
        "start": "2017-11-01",
        "end": "2018-02-19",
        "color": "rgba(255, 0, 0, 1)",
        "name": "Test",
    },
]

fig_dates.update_layout(
    shapes=[
        dict(
            type="rect",
            xref="x",
            yref="paper",
            x0=interval["start"],
            x1=interval["end"],
            y0=0,
            y1=1,
            fillcolor=interval["color"],
            opacity=0.25,
            layer="below",
            line_width=0,
        )
        for interval in highlight_intervals
    ],
)

for interval in highlight_intervals:
    fig_dates.add_trace(
        go.Scatter(
            x=[interval["start"], interval["end"]],
            y=[0, 0],
            mode="lines",
            line=dict(color=interval["color"], width=5),
            name=interval["name"],
            showlegend=True,
        ),
        row=1,
        col=1,
    )

fig_dates.update_layout(
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="right",
        x=1,
        bgcolor="rgba(255, 255, 255, 0.5)",
    ),
)

fig_dates.show(renderer="notebook")

Figure 3: Temporal distribution of the news items.

  • There is significantly more true headlines in the latter portion of the data set.
  • There is are some fake headlines from a time period that does not heave true headlines.

Based on these points the data was split into three sets:

  • Training: from 2016 01 01 to 2017 08 31
  • Validation: from 2017 09 01 to 2017 10 31
  • Testing: from 2017 11 01 to 2018 02 19
Show the code
train_data = merged_data.filter(
    pl.col("date").is_between(date(2016, 1, 1), date(2017, 8, 31))
)

val_data = merged_data.filter(
    pl.col("date").is_between(date(2017, 9, 1), date(2017, 10, 31))
)

test_data = merged_data.filter(pl.col("date") >= date(2017, 11, 1))

Class distribution in split data sets:

Show the code
fig_split_distr, ax_split_distr = plt.subplots(1, 3, figsize=BASE_FIG_SIZE, sharey=True)
for i, (name, data) in enumerate(
    {"Training": train_data, "Validation": val_data, "Testing": test_data}.items()
):
    sns.countplot(data["is_fake"].replace({0: "True", 1: "Fake"}), ax=ax_split_distr[i])

Figure 4: Class balance of news items in the split datasets.

4.2 Feature comparison

4.2.1 Word Occurrences

Calculating word frequency:

Show the code
true_tokens = word_tokenize(" ".join(true_headlines["text"]))
true_tokens = [word.lower() for word in true_tokens if word.isalpha()]
true_tokens = [word for word in true_tokens if word not in stopwords.words("english")]
true_word_frequencies = Counter(true_tokens)

fake_tokens = word_tokenize(" ".join(fake_headlines["text"]))
fake_tokens = [word.lower() for word in fake_tokens if word.isalpha()]
fake_tokens = [word for word in fake_tokens if word not in stopwords.words("english")]
fake_word_frequencies = Counter(fake_tokens)

true_word_frequencies = (
    pl.DataFrame(true_word_frequencies)
    .transpose(include_header=True, header_name="word", column_names=["count"])
    .sort("count", descending=True)
)
fake_word_frequencies = (
    pl.DataFrame(fake_word_frequencies)
    .transpose(include_header=True, header_name="word", column_names=["count"])
    .sort("count", descending=True)
)

true_word_frequencies = true_word_frequencies.with_columns(pl.lit(0).alias("is_fake"))
fake_word_frequencies = fake_word_frequencies.with_columns(pl.lit(1).alias("is_fake"))

word_frequencies = pl.concat([true_word_frequencies, fake_word_frequencies])

word_frequencies = word_frequencies.filter(pl.col("count") > 100)

word_frequencies = word_frequencies.with_columns(
    pl.when(pl.col("is_fake") == 0)
    .then(pl.col("count") / len(word_frequencies.filter(pl.col("is_fake") == 0)))
    .otherwise(pl.col("count") / len(word_frequencies.filter(pl.col("is_fake") == 1)))
    .alias("per_headline")
).sort("per_headline", descending=True)

words_not_in_fake = []
for word in word_frequencies["word"]:
    if word not in word_frequencies.filter(pl.col("is_fake") == 1)["word"]:
        words_not_in_fake.append(word)

words_not_in_true = []
for word in word_frequencies["word"]:
    if word not in word_frequencies.filter(pl.col("is_fake") == 0)["word"]:
        words_not_in_true.append(word)

word_frequencies = pl.concat(
    [
        word_frequencies,
        pl.DataFrame(
            {
                "word": words_not_in_fake,
            }
        ).with_columns(
            pl.lit(0, dtype=pl.Int64).alias("count"),
            pl.lit(1).alias("is_fake"),
            pl.lit(0, dtype=pl.Float64).alias("per_headline"),
        ),
        pl.DataFrame(
            {
                "word": words_not_in_true,
            }
        ).with_columns(
            pl.lit(0, dtype=pl.Int64).alias("count"),
            pl.lit(0).alias("is_fake"),
            pl.lit(0, dtype=pl.Float64).alias("per_headline"),
        ),
    ]
)


joblib.dump(word_frequencies, "temp/word_frequencies.joblib")
['temp/word_frequencies.joblib']
Show the code
word_frequencies = joblib.load("temp/word_frequencies.joblib")

Word cloud comparison:

Show the code
fig_clouds, ax_clouds = plt.subplots(2, 2, figsize=BASE_FIG_SIZE)
ax_clouds = ax_clouds.flatten()
wordclouds = {}
wordclouds["Fake Text"] = WordCloud(stopwords=STOPWORDS).generate(
    " ".join(nltk.word_tokenize(" ".join(fake_headlines["text"][:5000]).lower()))
)
wordclouds["True Text"] = WordCloud(stopwords=STOPWORDS).generate(
    " ".join(nltk.word_tokenize(" ".join(true_headlines["text"][:50000]).lower()))
)
wordclouds["Fake Title"] = WordCloud(stopwords=STOPWORDS).generate(
    " ".join(nltk.word_tokenize(" ".join(fake_headlines["title"][:5000]).lower()))
)
wordclouds["True Title"] = WordCloud(stopwords=STOPWORDS).generate(
    " ".join(nltk.word_tokenize(" ".join(true_headlines["title"][:5000]).lower()))
)

for i, cloud_key in enumerate(wordclouds.keys()):
    ax_clouds[i].imshow(wordclouds[cloud_key])
    ax_clouds[i].axis("off")
    ax_clouds[i].set_title(cloud_key)
plt.tight_layout()
plt.show()

Figure 5: Word cloud of the most common words in true and fake headline texts and titles.

The most common words overall seem similar in both true and fake news. These mainly include names of politicians and locations.

Bi-gram cloud comparison:

Show the code
fig_clouds_2gram, ax_clouds_2gram = plt.subplots(2, 2, figsize=BASE_FIG_SIZE)
ax_clouds_2gram = ax_clouds_2gram.flatten()
wordclouds_2gram = {}
wordclouds_2gram["Fake Text"] = disf.generate_ngram_wordcloud(
    " ".join(fake_headlines["text"][:5000]).lower(), 2
)
wordclouds_2gram["True Text"] = disf.generate_ngram_wordcloud(
    " ".join(true_headlines["text"][:5000]).lower(), 2
)
wordclouds_2gram["Fake Title"] = disf.generate_ngram_wordcloud(
    " ".join(fake_headlines["title"][:5000]).lower(), 2
)
wordclouds_2gram["True Title"] = disf.generate_ngram_wordcloud(
    " ".join(true_headlines["title"][:5000]).lower(), 2
)
for i, cloud_key in enumerate(wordclouds_2gram.keys()):
    ax_clouds_2gram[i].imshow(wordclouds_2gram[cloud_key])
    ax_clouds_2gram[i].axis("off")
    ax_clouds_2gram[i].set_title(cloud_key)
plt.tight_layout()
plt.show()

Figure 6: Word cloud of the most common bi-grams in true and fake headline texts and titles.

A similar tendency is seen for bi-grams.

4.2.2 Word occurrence difference

Finding words with the largest count difference:

Show the code
common_words_both = (
    word_frequencies.filter(pl.col("is_fake") == 1)["word"][:15].to_list()
    + word_frequencies.filter(pl.col("is_fake") == 0)["word"][:15].to_list()
)


frequency_compare = word_frequencies.filter(pl.col("is_fake") == 0).join(
    word_frequencies.filter(pl.col("is_fake") == 1), on="word", suffix="_fake"
)

frequency_compare = frequency_compare.with_columns(
    (pl.col("per_headline") - pl.col("per_headline_fake")).alias("diff")
)

Common words that are more common in true headlines:

Show the code
fig_different_words, ax_different_words = plt.subplots(figsize=BASE_FIG_SIZE)
sns.barplot(
    word_frequencies.filter(
        pl.col("word").is_in(
            frequency_compare.sort(pl.col("diff"), descending=True)["word"][:10]
        )
    ),
    x="per_headline",
    y="word",
    hue="is_fake",
    ax=ax_different_words,
)
plt.show()

Figure 7: Common words that occur more frequently in true news.

Common words more frequent in fake headlines:

Show the code
fig_different_words_fake, ax_different_words_fake = plt.subplots(figsize=BASE_FIG_SIZE)
sns.barplot(
    word_frequencies.filter(
        pl.col("word").is_in(frequency_compare.sort(pl.col("diff"))["word"][:10])
    ),
    x="per_headline",
    y="word",
    hue="is_fake",
    ax=ax_different_words_fake,
)
plt.show()

Figure 8: Common words that occur more frequently in fake news.

Saving the most common differentiating words:

Show the code
words_common_fake = frequency_compare.sort(pl.col("diff"))["word"][:10]
words_common_true = frequency_compare.sort(pl.col("diff"), descending=True)["word"][:10]

4.3 Text Length Analysis

Token count comparison:

Show the code
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
fake_headlines = fake_headlines.with_columns(
    fake_headlines["text"]
    .map_elements(lambda x: len(tokenizer.encode(x, add_special_tokens=True)))
    .alias("text_token_count")
)

fake_headlines = fake_headlines.with_columns(
    fake_headlines["title"]
    .map_elements(lambda x: len(tokenizer.encode(x, add_special_tokens=True)))
    .alias("title_token_count")
)

true_headlines = true_headlines.with_columns(
    true_headlines["text"]
    .map_elements(lambda x: len(tokenizer.encode(x, add_special_tokens=True)))
    .alias("text_token_count")
)

true_headlines = true_headlines.with_columns(
    true_headlines["title"]
    .map_elements(lambda x: len(tokenizer.encode(x, add_special_tokens=True)))
    .alias("title_token_count")
)

clear_output()

fig_token_hist, ax_token_hist = plt.subplots(2, 1, figsize=BASE_FIG_SIZE, sharex=True)
sns.histplot(true_headlines["text_token_count"], ax=ax_token_hist[0], binwidth=25)
sns.histplot(fake_headlines["text_token_count"], ax=ax_token_hist[1], binwidth=25)
ax_token_hist[0].axvline(512, linestyle="--", c="red")
ax_token_hist[1].axvline(512, linestyle="--", c="red")
ax_token_hist[0].set_xlim(0, 2000)
plt.show()

Figure 9: Distributions of token counts of headline texts when tokenized with Bert’s uncased tokenizer. The red line indicates 512 tokens (Bert context window limit.)

Notably, fake headlines tend to be shorter in length. However, a significant proportion of both true and fake headlines exceed the typical context length of “classical” NLP neural network models, such as BERT. This raises concerns about the potential need for larger context windows to effectively identify fake news, as older models may not be equipped to handle longer headlines.

The upper quartile of token counts:

Show the code
print(f"True headlines: {round(true_headlines['text_token_count'].quantile(0.75))}")
print(f"Fake headlines: {round(fake_headlines['text_token_count'].quantile(0.75))}")
True headlines: 645
Fake headlines: 625

The baseline is a modest F1-Score of 0.46

5 Baseline Model

A baseline model is set based on heuristic rules derived from commons words with the largest difference in occurrence between True and Fake news. The existence of a word more common to fake news inside a given text yields a point while the existence of a word more common in true news deduces a point. At the end of the evaluation texts with scores larger than 0 are classified as fake news.

Baseline Predictions:

Show the code
def base_line_prediction(text):
    score = 0
    for word in words_common_fake:
        if word in text:
            score += 1
    for word in words_common_true:
        if word in text:
            score -= 1
    if score > 0:
        return 1
    else:
        return 0


baseline_predictions = val_data["text"].map_elements(base_line_prediction)

Baseline F1-Score:

Show the code
round(f1_score(val_data["is_fake"], baseline_predictions), 2)
0.46

6 Model Training

6.1 Naive Bayes

The code for Naive Bayes (NB) model training can be found inside naive_bayes.ipynb notebook.

Best options for text data representation were determined during the hyperparameter tuning process. The results yielded that:

  • For headline text data, the Naive Bayes model yielded the best F-1 score of 0.88 when trained on bi-grams (pairs of consecutive words).
  • For title data, the highest F-1 score of 0.64 was obtained when training the Naive Bayes model on single words.

Despite stacking the Naive Bayes models using Logistic Regression, the results did not demonstrate any improvement.

6.2 Tree Based Models

The code for tree-based model training can be found inside trees.ipynb.

The next iteration of the fake news classifier involves text feature engineering and tree-based models. Several new features were added to the dataset, including:

  • Token count: The total number of tokens in the title and headline text of each item
  • Capital letter count: The number of capital letters per token in the headline text and title for each item
  • Sentiment features: Vader sentiment intensity features for each item
  • Naive Bayes probabilities: The probabilities generated by the previously described Naive Bayes models

Two types of models were trained on this expanded feature set: LightGBM with different algorithms and a single decision tree. Hyperparameter tuning was performed for each model to optimize their performance.

The best results were achieved with a LightGBM model that utilized the Dropouts meet Multiple Additive Regression Trees algorithm, with a learning rate of 0.06, a maximum of 71 leaves, and 163 estimators. An F1-score of 0.980 was achieved with this model, significantly outperforming the previous Naive Bayes models. The addition of capital letter count and text length features was found to be particularly effective in improving the model’s accuracy.

6.3 Neural Network Models

The code for neural-network model training can be found inside nn.ipynb notebook. Several experiments using neural network models were conducted.

Model Training Description
  • Data Loader. The data loader divides the data into training, validation, and test sets based on dates as described above.

  • Loss Function. Binary Cross entropy was used.

  • Fine-tuning is initiated based on a loss delta. The backbone layers are unfrozen with a lower learning rate.

  • Early Stopping. Model training halts using a loss delta metric. The model parameters from the best epoch are restored.

  • Logging. Experiments are logged using MLFlow. Parameters, metrics, and models from each experiment are uploaded to Databricks (Community Edition).

6.4 Model Validation and Selection

Loading the models:

Show the code
val_data_with_features = joblib.load("temp/val_data_with_features.joblib")
model_lgbm = mlflow.lightgbm.load_model(
    "dbfs:/databricks/mlflow-tracking/293372627156913/e8804ac4e5174f518bce4df685509091/artifacts/model"
)
model_nb_text = mlflow.sklearn.load_model(
    "dbfs:/databricks/mlflow-tracking/293372627156913/43e3bc16fc704671bdd3bf76163c6c90/artifacts/model"
)
model_nb_title = mlflow.sklearn.load_model(
    "dbfs:/databricks/mlflow-tracking/293372627156913/716fd741c5ed45d78c6e3d4afcc3c3f5/artifacts/model"
)
model_bert = mlflow.pytorch.load_model(
    "dbfs:/databricks/mlflow-tracking/293372627156913/a7cc8827a0e047a2878c11a1bc9a29cc/artifacts/model"
)
model_bert.to('cuda')
model_bert.eval()

data_merged_text_title = merged_data.with_columns(
    pl.concat_str([pl.col("title"), pl.col("text")], separator=". ").alias("text")
)
tokenizer=AutoTokenizer.from_pretrained('bert-base-cased')
data_module = NewsDataModule(data_merged_text_title, tokenizer)
data_module.setup()
clear_output()

Bert Predictions:

Show the code
val_data_loader = data_module.val_dataloader()
batch_iterator = iter(val_data_loader)
bert_predictions = []
validation_true_vals = []
device = torch.device("cuda")
for _ in range(len(val_data_loader)):
    batch = next(batch_iterator)
    with torch.no_grad():
        outputs = model_bert(
            batch["input_ids"].to("cuda"), batch["attention_mask"].to("cuda")
        )
    bert_predictions.append(outputs)
    validation_true_vals.append(batch["label"].to("cuda"))
    torch.cuda.empty_cache()

LightGBM model F1-Score on the validation set:

Show the code
preds_lgbm = model_lgbm.predict_proba(
    val_data_with_features.drop(
        ["title", "text", "subject", "language", "date", "is_fake"]
    )
)
clear_output()
round(f1_score(val_data_with_features["is_fake"], preds_lgbm[:, 1] >= 0.5), 3)
0.98

Bert model F1-Score on the validation set:

Show the code
bert_predictions_np=torch.cat(bert_predictions).sigmoid().cpu().numpy()
validation_true_vals_np=torch.cat(validation_true_vals).cpu().numpy()
round(f1_score(validation_true_vals_np,bert_predictions_np>=0.5),3)
0.986

While the Bert model’s complexity was significantly higher, it still managed to outperform the naive Bayes and engineered feature-based tree models by a mere 0.06 F-score points, demonstrating its effectiveness despite the added complexity.

6.5 Adversarial Testing

In order to evaluate to what extent the models would perform if the global news context were to change, the names of the two most commonly mentioned politicians were altered in the validation data.

Changing the names of politicians:

Show the code
def change_candidates(text):
    text = (
        text.replace("Donald", "John")
        .replace("donald", "john")
        .replace("DONALD", "JOHN")
    )
    text = text.replace("Trump", "Doe").replace("trump", "doe").replace("TRUMP", "DOE")
    text = (
        text.replace("Hillary", "Jane")
        .replace("hillary", "jane")
        .replace("HILLARY", "JANE")
    )
    text = (
        text.replace("Clinton", "Doe")
        .replace("clinton", "doe")
        .replace("CLINTON", "DOE")
    )
    return text


val_data_adverse = val_data_with_features.with_columns(
    pl.col("text").map_elements(change_candidates)
)

val_data_adverse = val_data_adverse.with_columns(
    pl.col("title").map_elements(change_candidates)
)

Getting Naive Bayes probabilities on the changed data set:

Show the code
nb_preds_text = model_nb_text.predict_proba(val_data_adverse["text"])[:, 1]
nb_preds_title = model_nb_title.predict_proba(val_data_adverse["title"])[:, 1]
val_data_adverse = val_data_adverse.with_columns(
    pl.Series("nbayes_text_probability", nb_preds_text)
)
val_data_adverse = val_data_adverse.with_columns(
    pl.Series("nbayes_title_probability", nb_preds_title)
)

Light GBM F1-Score on a data set with changed names:

Show the code
preds_lgbm = model_lgbm.predict_proba(
    val_data_adverse.drop(["title", "text", "subject", "language", "date", "is_fake"])
)
clear_output()
round(f1_score(val_data_with_features["is_fake"], preds_lgbm[:, 1] > 0.5), 3)
0.979

Bert model F1-score on data with changed names:

Show the code
data_merged_text_title_ad = data_merged_text_title.with_columns(
    pl.col("text").map_elements(change_candidates)
)
data_module = NewsDataModule(data_merged_text_title_ad, tokenizer)
data_module.setup()
val_data_loader = data_module.val_dataloader()
batch_iterator = iter(val_data_loader)
bert_predictions = []
validation_true_vals = []
device = torch.device("cuda")
model_bert.to(device)
model_bert.eval()
for _ in range(len(val_data_loader)):
    batch = next(batch_iterator)
    with torch.no_grad():
        outputs = model_bert(
            batch["input_ids"].to("cuda"), batch["attention_mask"].to("cuda")
        )
    bert_predictions.append(outputs)
    validation_true_vals.append(batch["label"].to("cuda"))
    torch.cuda.empty_cache()

bert_preds_ad_np=torch.cat(bert_predictions).sigmoid().cpu().numpy()
val_true_vals_ad_np=torch.cat(validation_true_vals).cpu().numpy()
round(f1_score(val_true_vals_ad_np,bert_preds_ad_np>=0.5),3)
0.984

Both models lost only 0.001 F1-score if the names of the candidates are changed. Bert model will be used further on as it does outperform other models slightly.

6.5.1 Decision threshold optimization:

Show the code
fig_pr, ax_pr = plt.subplots(figsize=BASE_FIG_SIZE)
thresholds = np.linspace(0.0, 0.95, num=45)
precisions, recalls, f1_scores = [], [], []
for threshold in thresholds:
    binary_preds = bert_predictions_np >= threshold
    precisions.append(
        precision_score(validation_true_vals_np, binary_preds, zero_division=np.nan)
    )
    recalls.append(
        recall_score(validation_true_vals_np, binary_preds, zero_division=np.nan)
    )
    f1_scores.append(
        f1_score(validation_true_vals_np, binary_preds, zero_division=np.nan)
    )
max_f1 = np.nanmax(f1_scores)
ax_pr.plot(thresholds, precisions, label="Precision")
ax_pr.plot(thresholds, recalls, label="Recall")
ax_pr.plot(thresholds, f1_scores, label="F1")
ax_pr.set_xlabel("Threshold")
ax_pr.set_ylabel("Score")
ax_pr.legend(loc="lower center")
ax_pr.set_ylim((0, 1.18))
best_threshold = thresholds[np.nanargmax(f1_scores)]
ax_pr.annotate(
    f"Support: {int(validation_true_vals_np.sum())}\nMax F1-score: {max_f1:.3f}\nBest Threshold:{best_threshold}",
    (0, 1.02),
)
plt.show()

Figure 10: The precision, recall and F1 scores of the BERT model’s predictions with different decision thresholds.

The model has an extremely high recall throughout the decision threshold range which means it can be increased up to 0.95 for an increased precision and F1-Score.

7 Model Final Evaluation and Interpretation

Making predictions on the test set:

Show the code
data_module = NewsDataModule(data_merged_text_title, tokenizer,batch_size=32)
data_module.setup()
test_data_loader = data_module.test_dataloader()
batch_iterator = iter(test_data_loader)
bert_predictions = []
validation_true_vals = []
device = torch.device("cuda")
model_bert.to(device)
model_bert.eval()
for _ in range(len(test_data_loader)):
    batch = next(batch_iterator)
    with torch.no_grad():
        outputs = model_bert(
            batch["input_ids"].to("cuda"), batch["attention_mask"].to("cuda")
        )
    bert_predictions.append(outputs)
    validation_true_vals.append(batch["label"].to("cuda"))
    torch.cuda.empty_cache()

bert_preds_test=torch.cat(bert_predictions).sigmoid().cpu().numpy()
test_true_vals=torch.cat(validation_true_vals).cpu().numpy()

Classification Metrics:

Show the code
print(classification_report(test_true_vals,bert_preds_test>=best_threshold, digits=3))
              precision    recall  f1-score   support

           0      1.000     0.999     0.999      5489
           1      0.991     0.998     0.995       580

    accuracy                          0.999      6069
   macro avg      0.996     0.999     0.997      6069
weighted avg      0.999     0.999     0.999      6069

The model achieved an F1-score of 0.995 on the test set which is even better than on the validation set.

Confusion matrix:

Show the code
fig_cm,ax_cm=plt.subplots(figsize=BASE_FIG_SIZE)
conf_mat = confusion_matrix(test_true_vals, bert_preds_test>=0.95)
sns.heatmap(conf_mat, annot=True, cmap='Blues',ax=ax_cm)
ax_cm.set_xlabel('Predicted Labels')
ax_cm.set_ylabel('True Labels')
ax_cm.set_yticklabels(['True','Fake'])
ax_cm.set_xticklabels(['True','Fake'])
plt.title('Confusion Matrix')
plt.show()

Figure 11: The confusion matrix of BERT model’s predictions on the test set.

The model did one type I error and only 5 Type II errors.

7.1 Lime Explanations:

Correct Fake Prediction

Show the code
np.where(~np.equal(test_true_vals, bert_preds_test>=0.95))
(array([ 273, 4074, 4556, 5024, 5432, 5459]),)

Correct Fake News Identification:

Show the code
explainer = LimeTextExplainer(class_names=['True','Fake'])
explanation = explainer.explain_instance(
    test_data_loader.dataset.texts[0],
    lambda x: predict_probabilities(x, model=model_bert),
    num_samples=2024
)

explanation.show_in_notebook()

Figure 12: Lime explanation of a correct Fake news prediction.

Correct True News Identification:

Show the code
explanation = explainer.explain_instance(
    test_data_loader.dataset.texts[-1],
    lambda x: predict_probabilities(x, model=model_bert),
    num_samples=2024
)

explanation.show_in_notebook()

Figure 13: Lime explanation of a correct True news prediction.

Type 1 error:

Show the code
explanation = explainer.explain_instance(
    test_data_loader.dataset.texts[273],
    lambda x: predict_probabilities(x, model=model_bert),
    num_samples=2024
)

explanation.show_in_notebook()

Figure 14: Lime explanation of a erroneous True news prediction.

Type 2 error:

Show the code
explanation = explainer.explain_instance(
    test_data_loader.dataset.texts[4074],
    lambda x: predict_probabilities(x, model=model_bert),
    num_samples=2024
)

explanation.show_in_notebook()

Figure 15: Lime explanation of a erroneous Fake news prediction.

Overall, the model’s predictive performance is impressive, despite the difficulty in identifying the specific features that contribute to its decisions. This complexity may actually be a beneficial trait in the context of fake news detection, where adversaries are constantly attempting to evade detection. By relying on features that are not immediately apparent, the model may be more resilient to attempts to manipulate or evade its detection, making it a more effective tool in the fight against disinformation.

8 Further Considerations and Potential Improvements

The difficult nature of fake news detection, coupled with the relentless efforts of troll farms and the dynamic global news landscape, underscores the need for continuous model updates. In light of these challenges, it is advisable to deploy a suite of models leveraging diverse features and update them regularly to stay ahead of the curve.

For the model presented in this project, several areas of immediate improvement are identified:

  • Incorporating language models with larger context windows could enhance prediction reliability, as most texts exceed the 512-token limit. The Longformer with an additional 100 tokens showed promising results.
  • Developing additional models with comparable performance, but based on distinct text features, would create a robust news filtering system.
  • More comprehensive adversarial testing of the models is recommended to simulate real-world scenarios.
  • A more diverse dataset should be employed in the future to mitigate over-fitting on specific writing styles, as most true news samples originated from a single source.

9 Conclusions

  • The dataset consisted of 45K news items, with 9K removed due to errors or unintelligibility. The cleaned dataset contained 15K true news items and 20K fake news items.
  • The distribution of true and fake news was uneven over time, with a time split used to create a balanced training set and more true news in the validation and test sets.
  • Analysis of the text revealed that common words in both fake and true news included people and locations, but fake news had more names, adverbs, and prepositions, while true news had more verbs.
  • The headline text was up to 2000 tokens long, with the upper quartile being 645 for true and 625 for fake news.
  • A baseline F1-score of 0.46 was achieved using a heuristic approach, while a lightweight model using LightGBM and Naive Bayes probabilities achieved a F1-score of 0.980.
  • The best-performing model was a BERT-based model with a custom classifier head, achieving an F1-score of 0.993 on the test set.
  • The model exhibited excellent recall and could tolerate a decision threshold of 0.95 with minimal loss of recall.
  • Adversarial testing showed that the model’s performance was marginally affected by changes in the global news context.
  • The model’s complex text features made it difficult to distinguish for humans, which could make it harder for adversaries to break.